> Performance bugs are out there, lurking and unseen
Is it a bug if it's unseen? Code that runs slower than it could in theory is literally everywhere. But is it worth paying the cost of the time to fix it, if you aren't experiencing any delays?
I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.
"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.
"Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all."
1. It's a user facing bug if I have to purchase a bigger laptop because someone decided to get the smallest element of a list by typing list.sort()[0].
2. It's a user facing bug if the use of my app heats up their phone, drains the battery, and leaves someone stranded w/o a way to get safely get home on a friday night.
3. It's a business facing bug if my AWS bill is in the hundreds of thousands per month when I could be paying a mere 10s of thousands. (This is more common than you'd think)
4. It's a user facing bug if a sudden increase in traffic (due to my product being featured somewhere) takes down my services and results in lost sales and brand damage.
I understand your point, it's not necessarily a problem if a backend feed is taking 45s instead of 30s, or if the the text input subroutine for an interactive app allocates.
As far as I can tell as a software developer and daily computer user, people aren't erring on the side of making their software too fast.
100% agree with all of these, and you did what the author didn't and added criteria that affect people. If these happen, then make the code faster!
Abrash is a well known games/graphics programmer, and part of his point is that if your game is already running at 60fps, there's no point in replacing list.sort()[0] with something else. No milliseconds saved will help anyone in that case. (Though yes, as many pointed out, power consumption may be an important metric).
Personally, I'd say if your backend feed is taking 45s and could easily take 30s, then you should fix it. :)
> people aren't erring on the side of making their software too fast
:) 100% agree... too fast isn't a problem we have. But, over-engineering is an exceedingly widespread problem, and that's where I'm coming from. If there's a user-focused metric that shows a problem, yes fix it. If there isn't a user-focused metric, then think twice. The article didn't advocate a user-focused metric as far as I could tell.
In The Practice of Programming Kernighan and Pike say that one should estimate how much time can be saved over the expected runs of the program and allocate no more than that much time to optimization.
If you're doing something once or once per year that takes 45 seconds rather than 30 then don't bother. If you're doing it once every hour, that's six minutes a day, 365 days a year. If it's on a minutely cron on a hundred systems, that's 144,000 seconds a day. If a few hours of time can save you 144,000 seconds of runtime every day, that's a clear winner assuming other higher priority things are done. If it's run minutely by thousands of customers on hundreds of thousands of servers...
> if your game is already running at 60fps, there's no point in replacing list.sort()[0]
If the game runs on a device with a battery then reducing that load is a reason. You already implied that though.
I suspect hard to quantify improvements in quality is not what you are trying to rail against. If the software is going to be used widely or for long duration hard to quantify improvements can easily save millions in the long run.
There are over-engineering efforts where people try to shoehorn in bad design strategies, over-abstract an already well abstracted or refactor well working code to be more "testable" when it is already tested. All of those should be opposed, but how do you draw the line be over-engineering and hard to quantify quality increases?
> how do you draw the line be over-engineering and hard to quantify quality increases?
That is a really good question. That is the question, IMO. I don't know the answer, and if I did I'm pretty sure I could make a lot of money doing it. It's a balance, maybe the right thing to do is just make sure you have forces in both directions, and don't let one force dominate for too long. Make sure in your working groups there are always people asking if there's a user facing metric, a real tangible reason or justification for the code you're planning to write or optimize. And also make sure there are people who are good at planning for the future and are good at coming up with user-facing justifications for upcoming plans.
I think that was true back when he wrote the words, but less so now. Even if you're not noticing a delay, code that completes quicker is generally code that uses less power and you can still notice the effect on your phones battery life - or on the size of your monthly bill from your data center.
Great point. Efficiency comes in different shapes. To respond, I'd say power & cost are human oriented metrics, and I take his point to be use a human oriented metric, for me it's as true now as it ever was.
> I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.
In a time of battery powered computers CPU cycles spent equate to energy consumed equates to reduction in battery life.
Even if your program does not impact the user's experience of runtime behaviour (latencies, delays) it's still impacting in the user's experience on his devices recharge cycle.
Also spinning fans generate noise, which is just as a nuisance as a laggy program.
A performance issue that is hard to detect at 10 rps becomes an issue when you're trying to push 1000 rps. A performance issue in a mobile app (or a desktop app on a laptop) might be undetectable from a responsiveness or frame rate point of view, but burn through your battery.
If you count memory footprint as a performance issue, thowr don't become apparent until you try to run more things concurrently, and find yourself thrashing when memory should be plentiful.
Yes, true, I agree with you. And performance becomes seen when you scale, or measure power consumption and battery life, or check your memory footprint.
I think what you're saying is congruent with what Michael Abrash is saying, that you address it if it matters. And sometimes it matters.
But almost all the code in the entire world can run faster if someone spends more time optimizing. Anything in any scripting language could be compiled and further optimized. Anything in a compiled language can be further improved, use less memory, access cache faster, etc. etc.
Should it be optimized just because it can? Should it be optimized if it doesn't matter under any human oriented metric, including power consumption? (The author didn't mention power consumption or memory footprint.)
Sometimes it doesn't matter, and looking for it when it doesn't matter is what we call premature optimization. Premature optimization is also a waste of resources, including time and money, and a huge factor in why almost all software arrives late and over budget.
Performance issues cost you money: either you're losing users due to bounce rate (there are studies out there that having pages load for more than 300-500ms or so will significantly increase your bounce rate), or you're paying huge prices for your cloud hosting or you're burning battery of your users and thus losing them as well.
The issue is that usually smart design avoids significant performance issues and design which does not consider preformance at all ("we'll optimize it later") usually has to be scrapped and rewritten due to how many systemic issues accumulate over time. When you hit a scaling wall, it's usually orders of magnitude more expensive to rewrite the application than spending a bit of time getting basics right before you started.
Totally agree with you. Those sounds like measurable human facing impacts that can and should be optimized.
FWIW, I've also personally watched millions of dollars get wasted on over-engineering, that's part of the context behind my comment. I'm pretty sure I've seen more money lost on problems that don't exist than I've seen lost on avoiding problems that do exist. (Of course, that's impossible to prove, and I could be wrong, but that's my personal feeling.)
And in no way do I advocate not considering performance at all. I've done a lot of high performance code, and I think it matters a lot. But I like that Abrash attaches it to something meaningful, not make it faster for the sake of making it faster. The author didn't demonstrate any meaningful metrics like you did, he only said you should make it faster because you can.
Oh yes of course... overnegineering is as dangerious as bad engineering. Finding the proper balance IMO is one of the most important skills of senior engineers. Sadly I've also seen way too many people misunderstand the "Don't optimize prematurely" tip and completely ignore any performance considerations when designing software in early stages. Fixing that hurts.
A performance issue that isn't an issue when doing a single thing can become huge when you're trying to automate doing it many thousands of times. On the other hand there are few ways which are better to obfuscate a code base than to try to optimize performance.
The rule of thumb is don't do anything stupid, but don't particularly worry about efficiency until there is a real need. And then, as Knuth advised, you start with a profiler because the performance problem probably isn't where you think.
>"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.
And then define pointless, to make a quantified decision on optimization you should use a performance model. Common model is the roofline model. https://en.wikipedia.org/wiki/Roofline_model
But I will never forget once having to work with some Windows software written by someone else. He decided to parse a CSV file by reading the whole thing in memory, and then literally chopping off one character at a time..copying the rest of the string. It made what should be an instantaneous import of a fairly small file into a 20 minute ordeal. Simply reading in one line at a time would have been a massive speedup.
I did not have access to change it. I asked him to fix it. I begged him to fix it. His response was to shrug and say, "Who cares? This runs as part of a batch process with nobody in front of it. It works fine." That may be, but I cared because I had to sit through it repeatedly while debugging certain misbehavior that arose in said batch process.
Years later I was talking to an ex-coworker who stayed at said company after I left. He reported that that original developer after I left had to make a fix to said batch process. While doing so he had fixed the performance bug and was an instant hero among a number of other people who had to sit through the same import repeatedly over time.
Yeah..it turns out that he had made the fix I'd originally asked. And nobody realized that he had caused the problem originally AND refused to fix it.
That's one developer that I'm glad I will never have to work with again...
This reminds me of a story I heard about a lead game programmer who made a habit of pre-allocating an unused megabyte at the beginning of a project. The last week before ship, when the game would crash out of memory often and everyone was freaking out and trying to squeeze every last byte out of their code, he'd comment the allocation nobody had noticed, free up a megabyte, and be the hero that made the game fit.
But in the version that I heard, the lead programmer didn't do it to be a hero. He did that because he knew that it is easier to not waste memory up front than it is to figure out where you can save it at the end. Therefore he made sure that memory became an obvious problem early on so that people were careful about it the rest of the way through development.
In other words because he did that, the game hit memory targets that it never could have had people waited until they tried to squeeze everything in at the last moment and only then found that they were out of memory.
> The fly-by-wire flight software for the Saab Gripen (a lightweight fighter) went a step further... Control flow went forward only. Sometimes one piece of code had to leave a note for a later piece telling it what to do, but this worked out well for testing: all data was allocated statically, and monitoring those variables gave a clear picture of most everything the software was doing.... No bug has ever been found in the “released for flight” versions of that code.
When I started writing Arduino code, I independently re-invented this technique. I write my code with only fixed-size loops and no delay() calls (basically cooperative multitasking), making heavy use of finite-state machines to keep track of what should happen when.
This style of code takes some getting used to, but its advantages are enormous: I see other people using delays and busy-loops and then struggling to get interrupts working to keep their device responsive, whereas I have a clean linear set of logic that can be easily understood and is free of strange race conditions and the like, and is guaranteed to keep responding to events.
This style also makes it easy to literally eyeball performance: attach a bicolor LED (one that has e.g. red and green LEDs in the same package) and tell it to toggle colors once per loop. If the main loop is taking more than 8 milliseconds or so the LED will start to flicker visibly; if something pauses the LED will go solid and it's obvious the main loop is frozen.
One of the interesting side effects of this style is that you sometimes choose to let code run slower than it could, and choose to use more memory than you need.
The point is (as you say) to make all loops & allocation as predictable as possible, so when you make them fixed size, you have to fix it on the worst case. Ideally there is no worst case, and the amount of work you do per frame or cycle is constant. But for anything non-trivial, chances are fixed allocations & loops aren't always needed, even though they're extremely useful.
This does exactly what you say, in my experience: it makes performance eyeball-able, you can see where problems are. Code becomes so easy to reason about that debugging is easier and time is saved.
I would like to see more of this style of code (specifically in the Arduino setting) as I'm having trouble seeing what you're up to from here. Is some place I can read more - ideally, actual code?
Most of the code I've written either can't be released for privacy reasons (I don't feel like scrubbing out the info of the person I wrote it for) or is weird one-off experimental code that's a pain to explain.
However, I did find one standalone project that illustrates the point quite elegantly; I've just written up some documentation and posted it on https://github.com/wolfgang42/iDropper-arduino
This code cheats a little bit in that it has some delay loops in it; this is because the timings are finicky and it was a lot easier to write this way. Notice, however, that the `idrop_loop()` function that drives everything still specifies a four-millisecond maximum runtime, far under anything that I've ever needed to worry about.
Let me know if you have any questions about this; I feel that some of the explanation I wrote isn't quite as clear as it should be, and because the code is so minimal (it's really more of a library than a full project) it doesn't illustrate the point quite as well as I'd like.
I don't have any examples to share, but I'll add that the thinking behind this style is sometimes codified by style guides for projects that requires high levels of code safety. The best example I can think of is the famous "Power of Ten" rules for code safety published by NASA. Follow these rules, and the byproduct will be the kind of code we're talking about.
Thanks for sharing the link. In JavaScript you can put functions inside functions, so if it's only used once you can put it where it's used, witch would have the same effect as in-lining it. But you also get to eat the cookie as you get the advantage of using functions.
I like the advice of setting a breakpoint and then stepping though all the code for example a frame. If what actually happens is way off from how the code is structured your team will have problems comprehending what the code does, witch will make it very hard to work with.
std::list is a linked list, so it makes sense that size() is linear. Only if you store the size separately, you get constant behavior (which the c++11 enforces), it's constant.
Your friend probably could've sped his code up even more by using a std::vector. Linked lists have very bad processor cache behavior (not allocated in continuous memory), there are not a lot of things a list does better than a vector.
Only if you prepend a lot, or insert in the middle of a list, it might have some advantages over a vector, even then, there're are better alternatives.
I have timed it on several implementations and posted (mostly on on reddit.com/r/cpp), but until you get several thousand items std::vector is just faster than std::list. Of course it varies from machine to machine, for example my 4th gen i7 could insert about 8,000 items before list finally got the same speed as inserting at the front of an std::vector and my AMD 8150 (The bulldozer thing way before ryzen) was about 3,500.
I did similar tests comparing linear scans through an std::vector and lookups in a std::map or std::unsorted_map, with similar results, many thousands of items are required before dumb vectors got slow.
My default rule is to use a std::array if the size is fixed or an std::vector if the size changes until I hit 1,000 items. Then look at the complexity notation of the operations I am doing and choose what containers to then to benchmark. Then I throw a couple microbenchmarks in the test suite and use whatever that tells me will be faster in my case. Sometimes I even go so far as to use ifdefs and typedefs to change the container for different builds.
In practice I use a lot of std::vectors and I sort them a lot. Sorted vectors are very fast in the general case. In the few workloads I have with large numbers of items, even with millions of items, I use std::sort and std::binary_search with std::vector because they can be much faster than other associative containers. You do need to use operator< intelligently and that means either already passing in an overload for the comparator function or making a simple container class that wraps these concepts the way that makes sense for your workload.
Amazing how often just replacing a std::vector with a gap buffer vector implementation will get both the benefits of vector storage combined with good insert/delete behavior (presuming locality of such operations which is often the case, especially when doing something that scans through the array).
I once had some code that I managed to speed up around 90%. It was returning a massive data structure that another function would then operate on. I sped it up by removing the data structure entirely, and instead adding a callback that would be called for each element of the data structure.
It's not always possible to do, and half the time it doesn't even help, but every now and then it's a good way to optimize.
One recent example of this is Synapse, the Matrix messaging server.
Synapse was generally considered to be inefficient and a huge memory hog, so they have been experimenting with various ways to rewrite it in Go, saying[1] at the end of last year:
"Synapse is in a relatively stable state currently ... we’re starting to hit some fundamental limitations of the architecture: ... python’s single-threadedness and memory inefficiency; ...; the fact the app papers over SQL problems by caching everything in RAM (resulting in synapse’s high RAM requirements); ..."
But in January[2] it was discovered, during the investigation of another problem, that the vast majority of the memory usage was due to a bug:
"It’s also revealed the root cause of why Synapse’s RAM usage is quite so bad – it turns out that it actually idles at around 200MB with default caching, but there’s a particular codepath which causes it to spike temporarily by 1GB or so – and that RAM is then not released back to the OS."
I wonder if they would still have started the Go rewrite if it hadn't been for that bug. (It seems the new implementation has a better architechture and is substantially faster, so I guess it's still a good idea though, but maybe not as urgent as before?)
This is a great example of why you can't take perf related issues intuitively or at face value. Profiling and in depth analysis is always needed. I wonder if this could have been more easily spotted earlier on if a better analysis was done.
We would absolutely still be doing the go-rewrite: while fixing that bug improved the memory usage, it didn't help actual performance much. Meanwhile Dendrite (the rewrite) is currently benchmarking 2 orders of magnitude faster than Synapse. Unfortunately Synapse's architecture simply has many design efficiencies :)
His particular example was with the standard C library function strcat where each new invocation traverses the enlarged string over and over from the beginning, in order to find the terminating NUL character and append the new part.
I define a special category of distributed system performance bugs. The fact that proper distributed tracing is not standard yet (especially outside of cloud environments) is a real pain here. The google white paper on Dapper [0] and all its implementations [1][2][3] in one form or another are appearing but usually target the Web environments. That means that for my legacy stack with multiple nodes connected via networks of different kinds I have to profile on multiple systems, aggregate different log types manually, think of time syncs (not everything logs utc..) and so on. To be honest we just reached the point where we try to get a European subsidiary project rolling where we address this problem. The time lost to trace down unexplainable time gaps between first displayed result in app and request are weeks, and in multiple occasions months. Just because we have no easy way to profile our distributed ecosystem.
After identifying who is to blame: network, db nodes, processing nodes, application logic, cache (invalidation); we can talk about fixing.
For fixing of course it would be great if you could extract a meta model of the application and another of the hardware. Changes to those models and some nice prediction of impact would be the ideal end goal. Or, if we are wishing anyway, some nice DSE on those models to optimize would be even better.
I know. But our stack is C#, Wcf, websockets with c++ where performance is required and some RestAPI/wpf/html5/typescript/javascript ui/clients.
The stack is something grown the past 15 or so years (medical business). The existing options are not mature enough yet for us to be used, or not available yet on our platforms.
Occasionally I like the idea of producing some sort of dynamic instrumentation code (I privately think of it as "shitgrind") that detects dynamic code that is doing redundant work and flags it. Of course, this doesn't tell you that the code is useless (i.e. I might have to initialize something that subsequently isn't used down some deep and not-anticipatable-at-compile-time set of branches), but it might provide hints that I'm doing a lot of writes that never match up with reads, or that the writes/calculations that I'm doing are going into places that already have that value. It would be incredibly slow but there are plenty of uses I can imagine where you could detect performance problems in a toy version of the problem (maybe running 100-1000x slower) that are the same problems that you would have in the real version of the problem at full scale. Would be a fun project, IMO.
Even though I agree that >>insert method of std::map is roughly 7 times slower than it should be<< is bad, these kind of problems are not too hard to find and solve if they are actually problematic for your software.
The most problematic performance issues I've come across were usually bad/premature optimizations that were not (correctly) validated against a simpler implementation as a performance baseline. Things like parallelism (multi-threading, web workers) or caching can absolutely tank performance if not done correctly. Plus they usually tend to make stuff more complex and bug-prone.
This is one of the few areas where adding more levels of abstraction cannot help you fix the problem. It's remarkably easy for some library function or low level of the program to have O(n) performance that ends up getting called in a loop with the same input or output.
> This is one of the few areas where adding more levels of abstraction cannot help you fix the problem.
I'm not sure that's true. It is of course true that adding levels of abstraction often degrades performance, but I don't think it's necessary.
In fact, programming languages themselves are abstractions. It's pretty well agreed at this point that a good C compiler will generate faster code than someone writing assembly. It's of course possible to write code that's at least as fast in assembly, but in practice, nobody would do it, because the program would be unmaintainable.
You can make abstractions fast, and one of the benefits of doing so is that you can make your abstraction be the easy way of doing things. That means, by picking the easiest solutions, the people on your team are also using an efficient solution.
> It's pretty well agreed at this point that a good C compiler will generate faster code than someone writing assembly.
We must be discussing with different people then. Have you ever looked at what a compiler spits out ? My rule is that a compiler is as good as a half-decent assembly coder. I won't even go into the limitations of compilers.
> It's of course possible to write code that's at least as fast in assembly, but in practice, nobody would do it, because the program would be unmaintainable.
You don't write the whole program in assembly. But you can write it with it still being maintainable if you use just a little constraint and forward planing. See the Fresh IDE and libraries that go with it [0].
If you want to write performant C, you should focus on writing simple code. Being smart, especially with needless abstractions, will more often then not result in slower programs. The "hello i am a compiler" crap was never true and won't be in the near future.
How many die-hard assembly programmers today would do a better job in register allocation, instruction scheduling based on the latency of each individual instruction, instruction selection to pick the instructions with the minimum encoded length, avoiding pipeline hazards or doing inlining decisions? These are hard optimization problems that require some real CPU juice to get right. An additional bonus is that the compiler does these really quickly for you and you don't have to worry about the instruction scheduling each time you refactor the code, for example.
Funny thing is that one of the bigger problems with optimizing compilers is register allocation. Not only is it difficult to write a good algorithm for it, but compilers don't always know what variable is most important to keep in a register. Gcc even implemented a weaker allocator [0] as the older one was too complex to decipher problems with it.
Instruction scheduling is not a problem on modern cpu's. Not that it's hard either way, just reorder things around memory access and heavy instructions like division. Agner Fog made a nice table of instruction latencies and throughputs [1].
Instruction encoding length... do you use 16bit integers or do you just int everything ? The compiler has to do what you tell it to do. Sure there are plenty of tricks around that but those are not the ones that give the compiler a remarkable advantage over humans. Tricks like using bitwise operations instead of multiplication and such and shortening equations is the advantage that the compiler has. But even that is not that big of a deal. (again, do you use the restrict keyword ? or static/define for constants ? do you know that C treats every "compile unit"(.c file) as a separate program ?)
I'm not sure it's an entirely apt comparison to put inefficient assembly programmers up against efficient C programmers. It may be much more difficult to get a skilled assembly programmer than it is to get a skilled C programmer, and the expectations of the assembly programmer may indeed be higher on a day-to-day basis, but saying C is faster than assembly because "Coder X" can write faster code in C doesn't really work for me.
> How many die-hard assembly programmers today would do a better job...
No idea. But I've done some assembly myself (as a console game programmer) and generally speaking, you don't have to think about any of the issues you raised to get faster code than what the compiler creates. There are some corner cases occasionally, but not very often. Does any of this mean I tend to choose assembly? No way.
> the compiler does these really quickly for you
It sounds like you generally agree with what I said? ;)
Depends upon the compiler. Might have been true in the past that compilers made worse code than a human. But today? I think compilers only get better at this. They would be hard to beat.
The console compilers (especially Nintendo's) do tend to be somewhat old & somewhat crappy. I haven't done game code in a couple of years, but for the decade that I did, the compilers on all platforms were 5-10 years behind current.
That said, there are reasons that compilers often make slower code than humans, reasons that may always exist. The primary reason to use a compiler isn't to get faster code, it's to use a programming language abstraction and get "fast enough" code. The abstractions are (generally speaking) what add layers that are hard for a compiler to optimize against a person who's not using abstractions. There's a long list of commonly used features that compilers have a hard time optimizing: virtual functions, the threat of aliasing, floating point pipelining, etc. etc.. Compilers are required to make difference choices than humans writing assembly would, because of the abstractions that we choose to use in higher level languages.
Lots of people agree that it's not hard to make assembly faster than an optimizing C compiler.
Yet, my very first C++ program was to declare an object that contained a byte, had a constructor to initialize that byte, and a method to return it. My 'main' declared an instance initialized with 'true' and the code simply returned the accessor method's value.
The generated code? "mov ax, 255; ret"
That was pretty good optimization, seeing completely through the abstraction and flattening it to the simplest possible implementation. I was impressed.
Anyway there're lots of further abstractions these days, harder to see through. I imagine a clever human, with domain knowledge, can write a loop to do what a high-level library used to do.
This is a great example of something compilers definitely do better than humans writing assembly.
Inlining code, especially with multiple passes, is something humans don't do. I still modularize in assembly and write functions, factor code used in multiple places into a single spot. The compiler is free to expand the same function inline a thousand places, and then use the expanded code to collapse to fewer instructions.
The flattening can definitely be cool to see. So in your example, what would you have done if you'd written the assembly yourself? Do you feel like the compiler did something you wouldn't have?
Many programmers would have declared the object, set its members, and written main to fetch and return the data member. Not unreasonable, since humans tend to write abstractions pretty much as they're stated.
There's not so much difference if you're writing C code, inspecting the resulting assembly, and tweaking the compiler flags and your data structures to get it to produce the assembly you want.
But if you're not aware of what the compiler is emitting, it's often not hard to pull an order of magnitude improvement out of the code.
I write assembly professionally, and I am significantly better at it than any compiler I've used. The considerations you list are all actually pretty easy to account for, but also mostly not very important for most code sequences on modern architectures(!)
- Most performance-critical code is regular enough to stay far away from the NP-hard corners of register allocation. In the rare instances when I need to do careful hand-allocation to get an inner loop to fit perfectly, the compiler usually gives up and spills several variables.
- Performance-critical code is rarely latency bound, and modern processors are wildly out-of-order, so scheduling based on latency is usually a waste of time (and sometimes a pessimization). Minimizing the critical path length of loop-carried dependencies and balancing port utilization are far more important factors in most code sequences.
When code is latency-bound, it's often due to bad design decisions further up the stack; refactoring those to shift the concern to throughput is vastly more profitable than anything you can do to reduce latency locally, and is the sort of optimization that a compiler simply can't do (and usually can't even tell you you should do). I can tell the guy calling my BLAS library that he should try to batch up many vectors and call GEMM instead of GEMV. The compiler usually doesn't do that, and no-matter how hard it works at optimizing GEMV, it's going to be bound by memory access.
- Encoding length is a factor for whole-program optimization (reducing i-cache pressure), but critical sections often fit into the decoded µop cache / loop buffer / inner cache / etc making this a non-issue. Sometimes you want to choose a particular encoding to optimize for e.g. fetch group alignment or to separate branches to aid the predictor, but there you're usually not trying to minimize the length but to get alignment without no-ops instead (and compilers generally don't do this optimization, because they don't have alignment/instruction size information until really late in the process).
- Avoiding pipeline hazards and getting inlining right are are both worthwhile, but really, really straightforward for the most part. (Inlining is hard to get right if you're trying to avoid wasted work in the name of compiling as fast as you can, but humans don't have the restriction).
The compiler works a lot faster than I do, but I'm a lot better at balancing these and other considerations and methodically optimizing a non-trivial code sequence until it's nearly optimal. I might take a week to optimize the same function that the compiler optimizes in a millisecond, but my code is faster. For most code, the compiler version is fine. For heavily-used system library functions, you definitely want my version instead.
How can we improve compilers so that they're as good as I am? Compilers today mainly work in terms of local transforms on a computation graph, which necessarily cannot capture long-range optimizations that are available to me; most expert assembly programmers I know work mainly in terms of a global resource utilization minimization approach. There's no real reason why we can't make compilers that work that way, but it will almost surely make them significantly slower, so it would be more of a super-optimizer mode that you would only want to run once.
I can also make inferences about what the programmer was "really" trying to do, and go down the hall to ask him. The compiler doesn't (usually) do that; there's a lot of opportunity for compilers to say "hey, I noticed that if x always has this common property, I can make your code much faster -- is that a valid assumption here?"
Compilers aren't bad, but there's a lot of room for improvement, still.
you should just write compilers. I wish processors would remove some features to save power such as OoO and defer it to compiler instruction scheduler.
Static scheduling has been tried in VLIW architectures, such as Itanium. You got both worse assembly too understand, compiler that is hard to write and finally worse performance.
Essentially you also lose they advantage of runtime optimization Compare plain java compiler with a java JIT compiler.
That's a bit of an oversimplification; static scheduling is used in extremely low-power architectures (The ARM Cortex-M series, for example), as well as GPUs and DSPs. It has its uses. However, the evidence is pretty compelling that it's not a great fit for "conventional" CPUs doing latency-sensitive tasks.
It's a good fit for low-power single-issue architectures and DSPs, because ... well, there's only trivial scheduling to be done because the machine is so dead simple. It's a good fit for GPUs because they are primarily throughput oriented, and can make scheduling issues "go away" by running lots of threads / warps / whatever marketing wants to call them to hide latency.
Technically compiler will have more information about the program than processor will ever have. And vice versa, compiler could not know (or very hard to) more than processor about real time info. If the performance is at steady-state, then I don't see why static insn scheduler wouldn't work given sound methodologies.
The biggest problem is that static scheduling requires recompiling your program for every micro-architecture. Anytime you change the latency of one instruction, or add an execution unit, etc, you need to recompile in order to get close to optimal performance. That's OK for some uses (Hello Gentoo Users and Other Genuine Humans), but an enormous hassle for most use cases -- the upshot is that most statically-scheduled code is non-optimal most of the time, and OoO eats its lunch.
The second biggest problem is that cache residency is often not statically predictable, which means that you generally have no idea in what order the various memory dependencies will actually be satisfied, which makes all your effort on scheduling approximately useless.
Those older Atoms were pretty universally terrible, by the way.
Something like tracing JIT within limited window of instructions could be the solution. Agreed that dynamism must be part of the solution for real data. Whether it's done on JIT or HW, there needs to be some feedback mechanism, ie. Perf Monitoring Unit on Intel X86.
While it's not my day-to-day job, I've been an occasional contributor to llvm and clang for almost a decade (though most of my more significant contributions have been in the form of behind-the-scenes conversations with folks who are more regular contributors).
I think you're going to be disappointed how few of those are handled for you by most compilers, unless you're using the Intel compiler and targeting a specific CPU.
It's true that libraries can cause performance issues, but it's not a given. Using template metaprogramming libraries (like STL and Boost in C++) can result in clean, high-level code which has very good performance. Often the "loops" and other high-level constructs are optimized right out of the program.
Also, link-time and whole-program optimization can remedy some performance issues across modules/libraries. [edit: not algorithmic problems, of course, LTO isn't going to rewrite your algorithms for you.]
It does, in some situations, create a visibility issue. Very common in java. Using some higher level library like cxf, for example, for https connectivity to services. Unknown to you, and somewhat difficult to unearth (many layers down the rabbit hole to find the actual socket code, not in the cxf code), is that connection pooling doesn't work well, and every connection spawns more than one underlying object.
Modern hardware is incredibly complex. It’s nearly impossible to reliably predict the performance outcome of the changes. Personally, when I work on optimizing performance of my code, I throw away like 25% of my changes after profiling.
Good architecture with well-defined levels of abstractions makes it easier to change algorithms, to replace implementation of algorithms with different ones, and so on. Even though abstractions are not always free performance-wise, often they do help fixing performance problems, albeit not directly.
Ex post facto adding, no. But if the usage of the function corresponds to a concept within your codebase, and if you were able to abstract the usage of that function away, it would obviously help you fix the problem.
Abstraction in software almost always involves using something that does more things than you actually need; of course it's not going to help with performance. Sometimes it means calculating related results one at a time, when they would be better calculated at the same time in a single pass, because the library to do the calculation can do either calculation on an array, but only one at a time.
I'm open sourcing a Java Selenium library with page performance assertions built in as part of my startup. It will let you do things like:
test.config.setSLAMaxTime(2000); // set time in MS
assertSLAMet(); // add this to any page
It will also allow you to assert specific URL's, and window timings. You can then use these assertions to pass/fail builds. I also have some other nifty ideas on preventing performance regressions I'm building into this (smart benchmarking vs. previous).
If anyone is interested, I'll be announcing it here when we open source it:
> What if your Selenium tests had performance asserts for CI/CD
Isn't this just... a single function? Like, how the heck do you build a startup around that. We do this as part of our Python selenium tests, I didn't realize it wasn't built in to be honest.
There's a lot more to it--this is just a tiny part we're giving away, not something we'll make money on.
We do real-browser load testing in the cloud using the same scripts folks use for integration testing. Because they can also run locally, you can run them alongside your integration tests with each commit, and you always know you have working scripts. Scripts are easy to code and maintain because they can use page objects like functional tests.
This means you can have versioned load test/monitoring scripts in your project that always work with a particular sha. When you deploy, you can start transactional monitoring immediately. It also means load testing is not a separate stage in your development lifecycle.
It's only a bug if the stakeholders that surround the program agree that there should be (or already is) in place a performance requirement which is not being met.
Opportunities to uncover performance that could be improved and impose requirements against it are out there, lurking unseen. Only problem is you have to convince people to care on a case-by-case basis.
At one point I was working on a utility to detect suboptimal sequence of method calls; canonical example is using "[1,2,3].select {|x| x > 1 }.first" rather than "[1,2,3].detect {|x| x > 1}". These can be performance issues, although the bigger win is in readability and communicating the developer intent. More details and examples here if you're interested:
I haven't worked on it much recently though because the problems it found weren't that significant. But I like the idea of runtime analysis for finding issues, especially in a dynamically-typed language.
An internet points out the obvious -- that performance matters in computing. Hackernews nods in agreement, then goes back to coding Ruby on Rails applications in an editor implemented in HTML5 and JavaScript that consumes 13% CPU just to make the cursor blink.
"It takes an experienced programmer, with a reasonably accurate mental model of the problem and the correct solution, to know how fast the operation should have been performed, and hence if the program is running slower than it should be."
I love this sort of approach to performance. If you know the hardware, the software, and the application, you can predict how fast it should go. That's surprisingly rare, in my experience, sadly. About the time I wrote lmbench I was sitting in a lot of meetings at Sun where people were saying "we should do this" and I could do a mental estimate of how fast it would go (I memorized most of the stuff that lmbench measured so I know how many packets/sec we could do, how many iops, how much memory bandwidth we had, etc). You'd be amazed at the number of times "architects" were proposing to build something that couldn't possibly work.
It's pretty systems-y and not that common in programmers, what with today's frameworks and all. But there is still some value in being able to predict how fast something should run. I love it when I see people doing that with the needed knowledge to be accurate, very fun to watch.
Performance sort of has a paradoxical "coffee test": something that takes a few seconds is aggravating and you will wait for it but something that routinely takes 5 minutes will make you go get coffee or switch tasks completely.
With enough alternate tasks, waits are absorbed. Oddly, overall throughput can be worse with sporadic tiny delays that are technically indicative of a fast task.
You don't just want fast operations, you want programs that can be smart about gathering unavoidable delays into chunks. This can even pay dividends when consuming resources, as the gathered steps may use less (e.g. laptop wakes dormant hardware just once to perform several grouped tasks, instead of repeatedly over a short time frame).
In practice the majority of web app performance have more to do with latency of IO. People seem to forget that 4ms may seem fast to humans, but it's glacial to computers. Add 100 4ms calls to the DB and all the sudden your code is waiting for half a second if it does nothing else.
Bonus points that in a real real-time system (anything with a GUI) you might have only a 1 ms or less to handle an event in a bigger stream, e.g. dragging. This is why so much UI is or should be really asynchronous.
We run into this a fair bit of the time with Hyperscan. One particularly noxious anti-pattern that can arise is the "prefilter that isn't" anti-pattern: specifically, a multiple stage pipeline where the initial stage is meant to filter out some portion of false positives when feeding to a more expensive check, but isn't doing its job.
The painful thing is that the expensive check downstream winds up doing everything, but the code still "works" because the expensive check is the only necessary part (well, as long as the pre-filter isn't totally hosed and isn't producing false negatives).
Oh boy, it becomes far worse when you are dealing with javascript and web apps. Pretty much all modern day "best practices" are not, but bullet hole leaks of performance.
I spent about a year researching this and put together my findings into this (too short) 20 minute tech talk: https://youtu.be/BEqH-oZ4UXI hopefully others will find it useful too.
Or just never use JS ;) but it is possible to make it fast.
The fun begins when you need to find performance issues in production, where you cannot really use profiler. Then you need to jump into APM tools. Unfortunately it seems there nothing really for that level for free, but take this for example:
https://www.dynatrace.com/blog/code-level-visibility-for-nod...
In my experience, it's very difficult to tie profiling data from generic profilers to specific requests, then to the specific lines-of-code triggering the problems.
This is important because many performance conditions don't reveal themselves all of the time: for example, it's very common that an issue might only be a problem for your largest customers. The context is really important.
> when I was working on the Doom 3 BFG edition release, the exactly predicted off-by-one-frame-of-latency input sampling happened, and very nearly shipped
What is this predicted off-by-one-frame-of-latency input sampling ?
If performance matters, benchmarking the heck out of everything helps quite a bit! I like the style of benchmarks that focus on comparing iterations per second.
Personal blog authors - if you insist on not using Medium/similar big box blogging service, just host your blog as flat files on S3. It will never, ever go down when you hit number 1 on HN (as this poor author's site has).
An alternative is to cache request at the web server level so that successive requests will not be more heavy than serving static page. Caching just for a few seconds (some people call it microcaching [1]) will let your blog survive without issues.
Is it a bug if it's unseen? Code that runs slower than it could in theory is literally everywhere. But is it worth paying the cost of the time to fix it, if you aren't experiencing any delays?
I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.
"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.
"Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all."
http://www.jagregory.com/abrash-black-book/#the-human-elemen...