Is it a bug if it's unseen? Code that runs slower than it could in theory is literally everywhere. But is it worth paying the cost of the time to fix it, if you aren't experiencing any delays?
I love the way Michael Abrash talks about optimizing. He emphasizes and re-emphasizes that optimization is for people not computers, and that if you aren't experiencing a delay, you're done.
"In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.
"Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all."
1. It's a user facing bug if I have to purchase a bigger laptop because someone decided to get the smallest element of a list by typing list.sort().
2. It's a user facing bug if the use of my app heats up their phone, drains the battery, and leaves someone stranded w/o a way to get safely get home on a friday night.
3. It's a business facing bug if my AWS bill is in the hundreds of thousands per month when I could be paying a mere 10s of thousands. (This is more common than you'd think)
4. It's a user facing bug if a sudden increase in traffic (due to my product being featured somewhere) takes down my services and results in lost sales and brand damage.
I understand your point, it's not necessarily a problem if a backend feed is taking 45s instead of 30s, or if the the text input subroutine for an interactive app allocates.
As far as I can tell as a software developer and daily computer user, people aren't erring on the side of making their software too fast.
Abrash is a well known games/graphics programmer, and part of his point is that if your game is already running at 60fps, there's no point in replacing list.sort() with something else. No milliseconds saved will help anyone in that case. (Though yes, as many pointed out, power consumption may be an important metric).
Personally, I'd say if your backend feed is taking 45s and could easily take 30s, then you should fix it. :)
> people aren't erring on the side of making their software too fast
:) 100% agree... too fast isn't a problem we have. But, over-engineering is an exceedingly widespread problem, and that's where I'm coming from. If there's a user-focused metric that shows a problem, yes fix it. If there isn't a user-focused metric, then think twice. The article didn't advocate a user-focused metric as far as I could tell.
If you're doing something once or once per year that takes 45 seconds rather than 30 then don't bother. If you're doing it once every hour, that's six minutes a day, 365 days a year. If it's on a minutely cron on a hundred systems, that's 144,000 seconds a day. If a few hours of time can save you 144,000 seconds of runtime every day, that's a clear winner assuming other higher priority things are done. If it's run minutely by thousands of customers on hundreds of thousands of servers...
If the game runs on a device with a battery then reducing that load is a reason. You already implied that though.
I suspect hard to quantify improvements in quality is not what you are trying to rail against. If the software is going to be used widely or for long duration hard to quantify improvements can easily save millions in the long run.
There are over-engineering efforts where people try to shoehorn in bad design strategies, over-abstract an already well abstracted or refactor well working code to be more "testable" when it is already tested. All of those should be opposed, but how do you draw the line be over-engineering and hard to quantify quality increases?
That is a really good question. That is the question, IMO. I don't know the answer, and if I did I'm pretty sure I could make a lot of money doing it. It's a balance, maybe the right thing to do is just make sure you have forces in both directions, and don't let one force dominate for too long. Make sure in your working groups there are always people asking if there's a user facing metric, a real tangible reason or justification for the code you're planning to write or optimize. And also make sure there are people who are good at planning for the future and are good at coming up with user-facing justifications for upcoming plans.
In a time of battery powered computers CPU cycles spent equate to energy consumed equates to reduction in battery life.
Even if your program does not impact the user's experience of runtime behaviour (latencies, delays) it's still impacting in the user's experience on his devices recharge cycle.
Also spinning fans generate noise, which is just as a nuisance as a laggy program.
A performance issue that is hard to detect at 10 rps becomes an issue when you're trying to push 1000 rps. A performance issue in a mobile app (or a desktop app on a laptop) might be undetectable from a responsiveness or frame rate point of view, but burn through your battery.
If you count memory footprint as a performance issue, thowr don't become apparent until you try to run more things concurrently, and find yourself thrashing when memory should be plentiful.
I think what you're saying is congruent with what Michael Abrash is saying, that you address it if it matters. And sometimes it matters.
But almost all the code in the entire world can run faster if someone spends more time optimizing. Anything in any scripting language could be compiled and further optimized. Anything in a compiled language can be further improved, use less memory, access cache faster, etc. etc.
Should it be optimized just because it can? Should it be optimized if it doesn't matter under any human oriented metric, including power consumption? (The author didn't mention power consumption or memory footprint.)
Sometimes it doesn't matter, and looking for it when it doesn't matter is what we call premature optimization. Premature optimization is also a waste of resources, including time and money, and a huge factor in why almost all software arrives late and over budget.
The issue is that usually smart design avoids significant performance issues and design which does not consider preformance at all ("we'll optimize it later") usually has to be scrapped and rewritten due to how many systemic issues accumulate over time. When you hit a scaling wall, it's usually orders of magnitude more expensive to rewrite the application than spending a bit of time getting basics right before you started.
FWIW, I've also personally watched millions of dollars get wasted on over-engineering, that's part of the context behind my comment. I'm pretty sure I've seen more money lost on problems that don't exist than I've seen lost on avoiding problems that do exist. (Of course, that's impossible to prove, and I could be wrong, but that's my personal feeling.)
And in no way do I advocate not considering performance at all. I've done a lot of high performance code, and I think it matters a lot. But I like that Abrash attaches it to something meaningful, not make it faster for the sake of making it faster. The author didn't demonstrate any meaningful metrics like you did, he only said you should make it faster because you can.
A performance issue that isn't an issue when doing a single thing can become huge when you're trying to automate doing it many thousands of times. On the other hand there are few ways which are better to obfuscate a code base than to try to optimize performance.
The rule of thumb is don't do anything stupid, but don't particularly worry about efficiency until there is a real need. And then, as Knuth advised, you start with a profiler because the performance problem probably isn't where you think.
And then define pointless, to make a quantified decision on optimization you should use a performance model. Common model is the roofline model. https://en.wikipedia.org/wiki/Roofline_model
It's not a bug, it's a feature. Gives you time to get coffee while the page loads.
But I will never forget once having to work with some Windows software written by someone else. He decided to parse a CSV file by reading the whole thing in memory, and then literally chopping off one character at a time..copying the rest of the string. It made what should be an instantaneous import of a fairly small file into a 20 minute ordeal. Simply reading in one line at a time would have been a massive speedup.
I did not have access to change it. I asked him to fix it. I begged him to fix it. His response was to shrug and say, "Who cares? This runs as part of a batch process with nobody in front of it. It works fine." That may be, but I cared because I had to sit through it repeatedly while debugging certain misbehavior that arose in said batch process.
Years later I was talking to an ex-coworker who stayed at said company after I left. He reported that that original developer after I left had to make a fix to said batch process. While doing so he had fixed the performance bug and was an instant hero among a number of other people who had to sit through the same import repeatedly over time.
Yeah..it turns out that he had made the fix I'd originally asked. And nobody realized that he had caused the problem originally AND refused to fix it.
That's one developer that I'm glad I will never have to work with again...
This reminds me of a story I heard about a lead game programmer who made a habit of pre-allocating an unused megabyte at the beginning of a project. The last week before ship, when the game would crash out of memory often and everyone was freaking out and trying to squeeze every last byte out of their code, he'd comment the allocation nobody had noticed, free up a megabyte, and be the hero that made the game fit.
But in the version that I heard, the lead programmer didn't do it to be a hero. He did that because he knew that it is easier to not waste memory up front than it is to figure out where you can save it at the end. Therefore he made sure that memory became an obvious problem early on so that people were careful about it the rest of the way through development.
In other words because he did that, the game hit memory targets that it never could have had people waited until they tried to squeeze everything in at the last moment and only then found that they were out of memory.
> The fly-by-wire flight software for the Saab Gripen (a lightweight fighter) went a step further... Control flow went forward only. Sometimes one piece of code had to leave a note for a later piece telling it what to do, but this worked out well for testing: all data was allocated statically, and monitoring those variables gave a clear picture of most everything the software was doing.... No bug has ever been found in the “released for flight” versions of that code.
When I started writing Arduino code, I independently re-invented this technique. I write my code with only fixed-size loops and no delay() calls (basically cooperative multitasking), making heavy use of finite-state machines to keep track of what should happen when.
This style of code takes some getting used to, but its advantages are enormous: I see other people using delays and busy-loops and then struggling to get interrupts working to keep their device responsive, whereas I have a clean linear set of logic that can be easily understood and is free of strange race conditions and the like, and is guaranteed to keep responding to events.
This style also makes it easy to literally eyeball performance: attach a bicolor LED (one that has e.g. red and green LEDs in the same package) and tell it to toggle colors once per loop. If the main loop is taking more than 8 milliseconds or so the LED will start to flicker visibly; if something pauses the LED will go solid and it's obvious the main loop is frozen.
(Edits for clarity and writing style.)
The point is (as you say) to make all loops & allocation as predictable as possible, so when you make them fixed size, you have to fix it on the worst case. Ideally there is no worst case, and the amount of work you do per frame or cycle is constant. But for anything non-trivial, chances are fixed allocations & loops aren't always needed, even though they're extremely useful.
This does exactly what you say, in my experience: it makes performance eyeball-able, you can see where problems are. Code becomes so easy to reason about that debugging is easier and time is saved.
However, I did find one standalone project that illustrates the point quite elegantly; I've just written up some documentation and posted it on https://github.com/wolfgang42/iDropper-arduino
This code cheats a little bit in that it has some delay loops in it; this is because the timings are finicky and it was a lot easier to write this way. Notice, however, that the `idrop_loop()` function that drives everything still specifies a four-millisecond maximum runtime, far under anything that I've ever needed to worry about.
Let me know if you have any questions about this; I feel that some of the explanation I wrote isn't quite as clear as it should be, and because the code is so minimal (it's really more of a library than a full project) it doesn't illustrate the point quite as well as I'd like.
I like the advice of setting a breakpoint and then stepping though all the code for example a frame. If what actually happens is way off from how the code is structured your team will have problems comprehending what the code does, witch will make it very hard to work with.
A friend once sped up some code 90% by changing a
list.size() == 0
yes, std::list::size() can be _linear_ in C++98.
Your friend probably could've sped his code up even more by using a std::vector. Linked lists have very bad processor cache behavior (not allocated in continuous memory), there are not a lot of things a list does better than a vector.
Only if you prepend a lot, or insert in the middle of a list, it might have some advantages over a vector, even then, there're are better alternatives.
I did similar tests comparing linear scans through an std::vector and lookups in a std::map or std::unsorted_map, with similar results, many thousands of items are required before dumb vectors got slow.
My default rule is to use a std::array if the size is fixed or an std::vector if the size changes until I hit 1,000 items. Then look at the complexity notation of the operations I am doing and choose what containers to then to benchmark. Then I throw a couple microbenchmarks in the test suite and use whatever that tells me will be faster in my case. Sometimes I even go so far as to use ifdefs and typedefs to change the container for different builds.
In practice I use a lot of std::vectors and I sort them a lot. Sorted vectors are very fast in the general case. In the few workloads I have with large numbers of items, even with millions of items, I use std::sort and std::binary_search with std::vector because they can be much faster than other associative containers. You do need to use operator< intelligently and that means either already passing in an overload for the comparator function or making a simple container class that wraps these concepts the way that makes sense for your workload.
I once had some code that I managed to speed up around 90%. It was returning a massive data structure that another function would then operate on. I sped it up by removing the data structure entirely, and instead adding a callback that would be called for each element of the data structure.
It's not always possible to do, and half the time it doesn't even help, but every now and then it's a good way to optimize.
Synapse was generally considered to be inefficient and a huge memory hog, so they have been experimenting with various ways to rewrite it in Go, saying at the end of last year:
"Synapse is in a relatively stable state currently ... we’re starting to hit some fundamental limitations of the architecture: ... python’s single-threadedness and memory inefficiency; ...; the fact the app papers over SQL problems by caching everything in RAM (resulting in synapse’s high RAM requirements); ..."
But in January it was discovered, during the investigation of another problem, that the vast majority of the memory usage was due to a bug:
"It’s also revealed the root cause of why Synapse’s RAM usage is quite so bad – it turns out that it actually idles at around 200MB with default caching, but there’s a particular codepath which causes it to spike temporarily by 1GB or so – and that RAM is then not released back to the OS."
I wonder if they would still have started the Go rewrite if it hadn't been for that bug. (It seems the new implementation has a better architechture and is substantially faster, so I guess it's still a good idea though, but maybe not as urgent as before?)
One thing I keep learning again and again: You never know what the performance problem is until you've measured it.
His particular example was with the standard C library function strcat where each new invocation traverses the enlarged string over and over from the beginning, in order to find the terminating NUL character and append the new part.
After identifying who is to blame: network, db nodes, processing nodes, application logic, cache (invalidation); we can talk about fixing.
For fixing of course it would be great if you could extract a meta model of the application and another of the hardware. Changes to those models and some nice prediction of impact would be the ideal end goal. Or, if we are wishing anyway, some nice DSE on those models to optimize would be even better.
The stack is something grown the past 15 or so years (medical business). The existing options are not mature enough yet for us to be used, or not available yet on our platforms.
The most problematic performance issues I've come across were usually bad/premature optimizations that were not (correctly) validated against a simpler implementation as a performance baseline. Things like parallelism (multi-threading, web workers) or caching can absolutely tank performance if not done correctly. Plus they usually tend to make stuff more complex and bug-prone.
I'm not sure that's true. It is of course true that adding levels of abstraction often degrades performance, but I don't think it's necessary.
In fact, programming languages themselves are abstractions. It's pretty well agreed at this point that a good C compiler will generate faster code than someone writing assembly. It's of course possible to write code that's at least as fast in assembly, but in practice, nobody would do it, because the program would be unmaintainable.
You can make abstractions fast, and one of the benefits of doing so is that you can make your abstraction be the easy way of doing things. That means, by picking the easiest solutions, the people on your team are also using an efficient solution.
We must be discussing with different people then. Have you ever looked at what a compiler spits out ? My rule is that a compiler is as good as a half-decent assembly coder. I won't even go into the limitations of compilers.
> It's of course possible to write code that's at least as fast in assembly, but in practice, nobody would do it, because the program would be unmaintainable.
You don't write the whole program in assembly. But you can write it with it still being maintainable if you use just a little constraint and forward planing. See the Fresh IDE and libraries that go with it .
If you want to write performant C, you should focus on writing simple code. Being smart, especially with needless abstractions, will more often then not result in slower programs. The "hello i am a compiler" crap was never true and won't be in the near future.
I've honestly never heard that before. If you'd said "a good C compiler will generate code faster than someone writing assembly," then I'd agree... ;)
Instruction scheduling is not a problem on modern cpu's. Not that it's hard either way, just reorder things around memory access and heavy instructions like division. Agner Fog made a nice table of instruction latencies and throughputs .
Instruction encoding length... do you use 16bit integers or do you just int everything ? The compiler has to do what you tell it to do. Sure there are plenty of tricks around that but those are not the ones that give the compiler a remarkable advantage over humans. Tricks like using bitwise operations instead of multiplication and such and shortening equations is the advantage that the compiler has. But even that is not that big of a deal. (again, do you use the restrict keyword ? or static/define for constants ? do you know that C treats every "compile unit"(.c file) as a separate program ?)
No idea. But I've done some assembly myself (as a console game programmer) and generally speaking, you don't have to think about any of the issues you raised to get faster code than what the compiler creates. There are some corner cases occasionally, but not very often. Does any of this mean I tend to choose assembly? No way.
> the compiler does these really quickly for you
It sounds like you generally agree with what I said? ;)
That said, there are reasons that compilers often make slower code than humans, reasons that may always exist. The primary reason to use a compiler isn't to get faster code, it's to use a programming language abstraction and get "fast enough" code. The abstractions are (generally speaking) what add layers that are hard for a compiler to optimize against a person who's not using abstractions. There's a long list of commonly used features that compilers have a hard time optimizing: virtual functions, the threat of aliasing, floating point pipelining, etc. etc.. Compilers are required to make difference choices than humans writing assembly would, because of the abstractions that we choose to use in higher level languages.
Lots of people agree that it's not hard to make assembly faster than an optimizing C compiler.
The generated code? "mov ax, 255; ret"
That was pretty good optimization, seeing completely through the abstraction and flattening it to the simplest possible implementation. I was impressed.
Anyway there're lots of further abstractions these days, harder to see through. I imagine a clever human, with domain knowledge, can write a loop to do what a high-level library used to do.
Inlining code, especially with multiple passes, is something humans don't do. I still modularize in assembly and write functions, factor code used in multiple places into a single spot. The compiler is free to expand the same function inline a thousand places, and then use the expanded code to collapse to fewer instructions.
The flattening can definitely be cool to see. So in your example, what would you have done if you'd written the assembly yourself? Do you feel like the compiler did something you wouldn't have?
But if you're not aware of what the compiler is emitting, it's often not hard to pull an order of magnitude improvement out of the code.
- Most performance-critical code is regular enough to stay far away from the NP-hard corners of register allocation. In the rare instances when I need to do careful hand-allocation to get an inner loop to fit perfectly, the compiler usually gives up and spills several variables.
- Performance-critical code is rarely latency bound, and modern processors are wildly out-of-order, so scheduling based on latency is usually a waste of time (and sometimes a pessimization). Minimizing the critical path length of loop-carried dependencies and balancing port utilization are far more important factors in most code sequences.
When code is latency-bound, it's often due to bad design decisions further up the stack; refactoring those to shift the concern to throughput is vastly more profitable than anything you can do to reduce latency locally, and is the sort of optimization that a compiler simply can't do (and usually can't even tell you you should do). I can tell the guy calling my BLAS library that he should try to batch up many vectors and call GEMM instead of GEMV. The compiler usually doesn't do that, and no-matter how hard it works at optimizing GEMV, it's going to be bound by memory access.
- Encoding length is a factor for whole-program optimization (reducing i-cache pressure), but critical sections often fit into the decoded µop cache / loop buffer / inner cache / etc making this a non-issue. Sometimes you want to choose a particular encoding to optimize for e.g. fetch group alignment or to separate branches to aid the predictor, but there you're usually not trying to minimize the length but to get alignment without no-ops instead (and compilers generally don't do this optimization, because they don't have alignment/instruction size information until really late in the process).
- Avoiding pipeline hazards and getting inlining right are are both worthwhile, but really, really straightforward for the most part. (Inlining is hard to get right if you're trying to avoid wasted work in the name of compiling as fast as you can, but humans don't have the restriction).
The compiler works a lot faster than I do, but I'm a lot better at balancing these and other considerations and methodically optimizing a non-trivial code sequence until it's nearly optimal. I might take a week to optimize the same function that the compiler optimizes in a millisecond, but my code is faster. For most code, the compiler version is fine. For heavily-used system library functions, you definitely want my version instead.
How can we improve compilers so that they're as good as I am? Compilers today mainly work in terms of local transforms on a computation graph, which necessarily cannot capture long-range optimizations that are available to me; most expert assembly programmers I know work mainly in terms of a global resource utilization minimization approach. There's no real reason why we can't make compilers that work that way, but it will almost surely make them significantly slower, so it would be more of a super-optimizer mode that you would only want to run once.
I can also make inferences about what the programmer was "really" trying to do, and go down the hall to ask him. The compiler doesn't (usually) do that; there's a lot of opportunity for compilers to say "hey, I noticed that if x always has this common property, I can make your code much faster -- is that a valid assumption here?"
Compilers aren't bad, but there's a lot of room for improvement, still.
Essentially you also lose they advantage of runtime optimization Compare plain java compiler with a java JIT compiler.
It's a good fit for low-power single-issue architectures and DSPs, because ... well, there's only trivial scheduling to be done because the machine is so dead simple. It's a good fit for GPUs because they are primarily throughput oriented, and can make scheduling issues "go away" by running lots of threads / warps / whatever marketing wants to call them to hide latency.
Old Intel Atoms also used to lack OoO execution.
The second biggest problem is that cache residency is often not statically predictable, which means that you generally have no idea in what order the various memory dependencies will actually be satisfied, which makes all your effort on scheduling approximately useless.
Those older Atoms were pretty universally terrible, by the way.
Also, link-time and whole-program optimization can remedy some performance issues across modules/libraries. [edit: not algorithmic problems, of course, LTO isn't going to rewrite your algorithms for you.]
Neither typically provides highest performance. (I.e. boost ublas vs Eigen)
Good architecture with well-defined levels of abstractions makes it easier to change algorithms, to replace implementation of algorithms with different ones, and so on. Even though abstractions are not always free performance-wise, often they do help fixing performance problems, albeit not directly.
test.config.setSLAMaxTime(2000); // set time in MS
assertSLAMet(); // add this to any page
If anyone is interested, I'll be announcing it here when we open source it:
Isn't this just... a single function? Like, how the heck do you build a startup around that. We do this as part of our Python selenium tests, I didn't realize it wasn't built in to be honest.
We do real-browser load testing in the cloud using the same scripts folks use for integration testing. Because they can also run locally, you can run them alongside your integration tests with each commit, and you always know you have working scripts. Scripts are easy to code and maintain because they can use page objects like functional tests.
This means you can have versioned load test/monitoring scripts in your project that always work with a particular sha. When you deploy, you can start transactional monitoring immediately. It also means load testing is not a separate stage in your development lifecycle.
Opportunities to uncover performance that could be improved and impose requirements against it are out there, lurking unseen. Only problem is you have to convince people to care on a case-by-case basis.
I haven't worked on it much recently though because the problems it found weren't that significant. But I like the idea of runtime analysis for finding issues, especially in a dynamically-typed language.
I love this sort of approach to performance. If you know the hardware, the software, and the application, you can predict how fast it should go. That's surprisingly rare, in my experience, sadly. About the time I wrote lmbench I was sitting in a lot of meetings at Sun where people were saying "we should do this" and I could do a mental estimate of how fast it would go (I memorized most of the stuff that lmbench measured so I know how many packets/sec we could do, how many iops, how much memory bandwidth we had, etc). You'd be amazed at the number of times "architects" were proposing to build something that couldn't possibly work.
It's pretty systems-y and not that common in programmers, what with today's frameworks and all. But there is still some value in being able to predict how fast something should run. I love it when I see people doing that with the needed knowledge to be accurate, very fun to watch.
With enough alternate tasks, waits are absorbed. Oddly, overall throughput can be worse with sporadic tiny delays that are technically indicative of a fast task.
You don't just want fast operations, you want programs that can be smart about gathering unavoidable delays into chunks. This can even pay dividends when consuming resources, as the gathered steps may use less (e.g. laptop wakes dormant hardware just once to perform several grouped tasks, instead of repeatedly over a short time frame).
The painful thing is that the expensive check downstream winds up doing everything, but the code still "works" because the expensive check is the only necessary part (well, as long as the pre-filter isn't totally hosed and isn't producing false negatives).
- CPU usage (complexity, cache misses...)
- Memory usage (leaked resources like memory, connections and handles)
- Excessive I/O (networking, files...)
- Concurrency (contention, starvation, deadlocks, livelocks...)
There's no one size fits all profiler. I find load testing in combination with profiling very useful to get an idea of how a system performs.
Usually I keep injecting load until the system fails then find what failed.
There are other approaches, like automated performance testing (writing tests with assertions on performance). e.g: https://scalameter.github.io
I spent about a year researching this and put together my findings into this (too short) 20 minute tech talk: https://youtu.be/BEqH-oZ4UXI hopefully others will find it useful too.
Or just never use JS ;) but it is possible to make it fast.
This is important because many performance conditions don't reveal themselves all of the time: for example, it's very common that an issue might only be a problem for your largest customers. The context is really important.
Scout has a production-safe profiler for Ruby apps that builds on the wonderful StackProf gem that does this:
Edit: Back up, let's see how long it lasts now :)
Edit: back up after further 10min. Seems like it moves in waves.
What is this predicted off-by-one-frame-of-latency input sampling ?