My experience writing large scale search engine, NLP and computer vision services, generally as backend web services that have tight performance constraints for other client apps that consume from them, is basically fundamentally the opposite of what you say and what you quoted from the article.
It is virtually never the case that profiling reveals a uniformly slow mess of code, and in many cases this doesn’t even make sense as a possibility because you write modular performance tests and profiling tools the same as you write modular unit tests. You would never fire up the profiler and just naively profile a whole backend service all at once, apart from maybe getting extremely coarse latency statistics that only matter in terms of what stakeholder constraints you have to meet. For analyzing cases when you’re not just gathering whole-service stats for someone else, you’d always profile in a more granular, experiment-driven way.
> “Developers should have a solid grasp of computer hardware, compilers, interpreters, operating systems, and architecture of high performance software. Obliviousness to these is the source of performance-hostile program designs. If you know these things, you will write better performing code without consciously thinking about it.”
While it’s good to know more about more things generally, I think the specific claim that any of this will help you design better code in the first place is totally and emphatically wrong.
Instead it leads you down wrong tracks of wasted effort where someone’s “experience and intuition” about necessary optimizations or necessary architectural trade-offs ends up being irrelevant to the eventual end goal, the changing requirements from the sales team, etc. etc.
This happens so frequently and clashes so miserably with a basic YAGNI pragmatism that it really is a good heuristic to just say that early focus on optimization characteristics is always premature.
In the most optimization-critical applications I’ve worked on, getting primitive and poorly performing prototypes up and running for stakeholders has always been the most critical part to completing the project within the necessary performance constraints. Treating it like an iterative, empirical problem to solve with profiling is absolutely a sacred virtue for pragmatically solving the problem, and carries far, far less risk than precommitting to choices made on the speculative basis of performance characteristics borne out of someone else’s “experience” with performance-intensive concepts or techniques. Such experience really only matters later on, long after you’ve gathered evidence from profiling.
It sounds like the services you are describing basically take the form of "evaluate a pure function on an input". I am not going to argue with your stance there. I am more referring to programs that run interactively with some large mutable state (native GUI apps for me). The choices of how that state is structured has a huge impact on the performance ceiling of the app, and is often difficult to change after decisions have been made. Sibling example of a linked-list of polymorphic classes is the kind of thing I'm talking about. Once you have 100,000 lines of code that all operate on that linked list, you are stuck with it.
The services I’m talking about are not as you describe. It’s usually very stateful, often involving a complex queue system that collects information about the user’s request and processes it in ways that alter that user’s internal representation for recommender and collaborative filtering systems. It’s not like an RPC to a pure function, but is a very large-scale and multi-service backend orchestrating between a big variety of different machine learning services.
> “Once you have 100,000 lines of code that all operate on that linked list, you are stuck with it.”
No, I think this really is not true, and as long as the new “performant” redesigned linked list that you want to swap in can offer the same API, then this is a relatively easy refactoring problem. I’ve actually worked on problems like this where some deeply embedded and pivotal piece of code needs to be refactored. This is a known entity kind of problem. Unpleasant, sure, but very straightforward.
You seem to discount the reverse version of this problem which I’ve found to be far more common and more nasty to refactor. In the reverse problem, instead of being “stuck” with a certain linked list, you end up being stuck with some mangled and indecipherable set of “critical sections” of code where someone does some unholy low-level performance optimization, and nobody is allowed to change it. You end up architecting huge chunks of the system around the required use of a particular data flow or a particular compiled extension module with a certain extra hacked data structure or something, and this stuff accumulates like kruft over time and gets deeply wedded into makefiles, macros, and deployment scripts, etc., all in the name of optimization.
Later on, when requirements change, or when the overall system faces new circumstances that might allow for trading away some performance in favor of a more optimized architecture or the use of a new third party tool or something, you just can’t, because the whole thing is a house of cards predicated on deep-seated assumptions about these kludged-together performance hacks. This is usually the death knell of that program and the point at which people reluctantly start looking into just rewriting it without allowing any overcommitment to optimization.
(One example where this was really terrible was some in-house optimizations for sparse matrices for text processing. Virtually every time someone wanted to extend those data structures and helper functions to match new features of comparable third-party tools, they would hit insane issues with unexpected performance problems, all boiling down to a chronic over-reliance on internal optimizations. It made the whole thing extremely inflexible. Finally, when the team decided to actually just route all our processing through the third party library anyway, it was then a huge chore to figure out what hacks could be stripped away and which ones were still needed. That particular lack of modularity truly was caused specifically by a suboptimal overcommitment to performance optimization.)
“Overcommitting” is usually a bad thing in software, whether it’s overcommitting to prioritize performance or overcommitting to prioritize convenience.
The difference is that trying to optimize performance ahead of time often leads to wasted effort, fast running things that don’t solve a problem anyone cares about anymore or lack that critical new feature which totally breaks the performance constraints.
Optimizing to make it easy to adapt, hack these things in, have a very extensible and malleable implementation is almost always the better bet because the business people will be happy you can accomodate rapid changes, and more able to negotiate compromises or delayed delivery dates if the flexibility runs into a performance problem that truly must be solved.
Your points are good. I did not intend to suggest that software developers should make the kind of decisions that lead to "mangled and indecipherable set of 'critical sections'". More that developers who understand their tools at a lower level are likely to choose architectures with better computational efficiency. Simple things like acyclic ownership, cache locality, knowing when to use value vs. reference semantics, treating aggregate data as aggregate instead of independent scalars, etc.
Often these choices do not present a serious dichotomy between readability and performance. You just need to make the right choice. Developers who understand their tools better are more likely to make the right choice.
My main argument is that this kind of background knowledge should be more widely taught. Performance is often presented as a tradeoff where you must sacrifice readability/modifiability to get it. I think this is not true. Choosing good data structures and program structure is not mutually exclusive with having "a very extensible and malleable implementation".
Why do you have 100kloc operating directly on a single shared mutable structure in the first place? That sounds like an architectural choice that's going to make any sort of change far more difficult than it needs to be.
Avoiding abstraction is another form of premature optimisation.
Ironic that you are displaying the exact degree of critique of the design that you should! Lack of thoughts like this are the issue.
Unfortunately this polymorphic list sort of design is far too common.
Much of the code that cares about this architecture is going to be in the implementation of these classes, not in some single and easily fixable place.
How many of us actually have any knowledge of how our compiler works? I only have a rudimentary knowledge of the Java AOT and JIT compilers. It's enough that with some work I can read `PrintAssembly` output but really, I'm not nearly that familiar with the compiler.
I think you're right in that attempting to write code that specifically matches all these things is likely to prematurely optimize. I develop on Linux on an Intel consumer desktop processor, and OS X on an Intel consumer laptop processor, and I deploy to a containerized instance running on a virtualized Intel processor running on an Intel server processor. This lets me get to market quite fast. And I should be trying to figure out the wide instructions on the Intel server and if they're a good idea? I didn't even know about pcmp{e,i}str{i,m} until a couple of years ago and I have used it precisely zero times.
As it stands, 400k rps and logging every request, for an ad syncing platform, and it's not breaking the bank. Spending time to learn the intricacies of the JVM is not likely to have yielded improvements.
"Except, uh, a lot of people have applications whose profiles are mostly flat, because they’ve spent a lot of time optimizing them.”
— This view is obsolete.
Flat profiles are dying.
Already dead for most programs.
Larger and larger fraction
of code runs freezingly cold,
while hot spots run hotter.
Underlying phenomena:
Optimization tends to converge.
Data volume tends to diverge.
I think the issue here is that the large scale systems you cite usually have performance being dominated by the relatively high cost of making a network hop. They dominate runtime to such an extent that unless you are writing code with which is >= O(n^2), your runtime is almost entirely dominated by network latency and the number of hops you make. Also many problems in these domains are embarrasingly parallel and you can effectively summon large number of machines to hold state for you in memory. Where it does get interesting is situations like games / garbage collectors where you want to minimise latency. On such problems you usually cant bolt on performance after the fact.
No, this is not at all what I’m talking about. I’m specifically talking about cases with tight latency requirements after the request has been received. Like receiving a POST request containing many images, and invoking complex image preprocessing with a mix of e.g. opencv and homemade algorithms and then feeding into a deep convolutional neural network. Latencies of 2-3 seconds per request might be the target, and a naive unoptimized implementation of just the backend code (totally unrelated to any network parts) might result in something 10x too slow.
In that case, it’s absolutely a best practice to start out with that 10x-too-slow-but-easy-to-implement-for-stakeholders version, and only harden or tighten your homemade computer vision code or use of opencv or some neural network tool after profiling.
> “Where it does get interesting is situations like games / garbage collectors where you want to minimise latency. On such problems you usually cant bolt on performance after the fact.”
No, that’s exactly the sort of case I’m talking about. It’s not only possibly to refactor for performance after the fact, but dramatically easier to do so, less risky in terms of optimizing for the wrong things, and less prone to re-work.
Many were there from the start, and then some changed a few months later, and then the whole thing changed radically a year after that... which is part of the point. Business-case software is always a moving target.
In one particular application, the required backend calculations were so difficult that a workflow with a queue system was created that would give around 3 seconds as an upper limit on the request time before negatively affecting user experience (it involved ajax stuff happening in a complex in-browser image processing studio).
The type of neural networks being invoked on the images have a state of the art runtime of maybe 1 second per image, after pulling out all the complex tricks like truncating the precision of the model weights, and using a backend cluster of GPUs to serve batch requests from the queue service, which adds a huge degree of deployment logic, runtime costs, testing complexity, fault tolerance issues, etc.
The naive way, using CPUs for everything and not bothering with any other optimizations, had a latency of about 10s per request. Definitely too slow for the final deliverable, but it allowed us to make demos and tools within which stakeholders could guide us about what needed to change, what performance issues were changing, usability, getting endpoint APIs stable, etc.
It would have slowed us down way too much to be practical if we had tried to keep an eye out for making our whole deployment pipeline abstract enough to just flip a switch to run with GPUs months before that performance gain was necessary or before we’d done diagnostics and profiled the actual speedup of GPUs for specific use cases and accounted for the extra batch processing overhead.
And this was all just one project. At the same time, another interconnected project on the text processing side had a latency requirement of ~75 milliseconds per request. And it was the same deal. Make a prototype essentially ignorant of performance just to get a tight feedback loop with stakeholders, then use very targeted, case-by-case profiling, and slowly iterate to the performance goal later.
That is sensible advice, but sometimes you cannot slowly iterate and getting to your performance target from a simple prototype realistically requires a step change in your architecture. Often you know enough of your problem that you can determine analytically that certain approaches will never deliver sufficient performance. In that case you might want a performance prototype after you created a functional prototype, then iterate by adding functionality while checking for performance regressions.
There must be some misunderstanding here, because what you're saying in this post and others just doesn't add up...
Specifically you've said that:
* it's wrong to think that knowing about HW, compilers/interpreters, OSes and high-perf architecture will in fact help you "design better (performing?) code in the first place".
* changing goals from the sales team has a big impact on the performance optimisations and this happens frequently
* in your experience taking a poorly performing prototype and optimising it iteratively was a successful strategy. Poorly performing, as an example, might be 10x too slow.
* "It’s not only possibly to refactor for performance after the fact, but dramatically easier to do so, less risky in terms of optimizing for the wrong things, and less prone to re-work."
My experience doing C++, Java, JS mostly on embedded and mobile does not match your above statements. I'm not an optimisation expert and in fact this is not even my main work area, but I've seen plenty of catastrophic decisions which hobbled the performance of systems and applications without any hope of reaching a fluid processing workflow besides an effort-intensive rewrite. Furthermore, in those situations the development team was significantly slowed down by the performance problems.
In one case an architectural decision was made to use a specific persistency solution in a centralised way for the entire platform and this became a massive bottleneck. Good luck optimising that iteratively, sequentially or any other way.
Languages can have a huge impact on performance: in one project an inappropriate language was chosen; non-native languages usually trade memory usage for speed, and this caused problems with the total memory consumption that were never really fixed. I would argue that memory is a typical problem in embedded, along with disk I/O, GPU usage and everything else, not even close to all performance problems can be reduced to CPU usage (or using SQL properly as someone else unhelpfully quipped).
How many LoC do your projects typically have and what kind of languages and technologies do you use? I'd be interested to hear if the heavy lifting is done by library code or your own code and if you're able to easily scale horizontally to tackle perf problems.
And... how do changing sales goals every two weeks affect performance? Is this some stealth start-up situation? :)
The performance antagonistic design they’re talking about can’t benefit from what you’re talking about. Worse, it’s usually antagonistic to other things like well-contained modifications.
For instance, look at the Big Ball of Mud design. When you are looking at a Big Ball of Mud, these per module performance tests are difficult or impossible to maintain. It’s quite common to see a flame chart with high entropy. Nothing “stands out” after the second or third round of tweaking. Yes there are ways to proceed at this point, but they don’t appear in the literature
I realized in responding to someone else that I typically employ a strategy pretty similar to yours. That is, going module by module, decoupling, fixing one compartment at a time, and leaving it better tested (including time constraints) for the next go round and to reduce performance regressions. On a smallish team this works amazingly well.
Haven’t had as much luck with larger teams. Too many chefs and more churn to deal with.
It is virtually never the case that profiling reveals a uniformly slow mess of code, and in many cases this doesn’t even make sense as a possibility because you write modular performance tests and profiling tools the same as you write modular unit tests. You would never fire up the profiler and just naively profile a whole backend service all at once, apart from maybe getting extremely coarse latency statistics that only matter in terms of what stakeholder constraints you have to meet. For analyzing cases when you’re not just gathering whole-service stats for someone else, you’d always profile in a more granular, experiment-driven way.
> “Developers should have a solid grasp of computer hardware, compilers, interpreters, operating systems, and architecture of high performance software. Obliviousness to these is the source of performance-hostile program designs. If you know these things, you will write better performing code without consciously thinking about it.”
While it’s good to know more about more things generally, I think the specific claim that any of this will help you design better code in the first place is totally and emphatically wrong.
Instead it leads you down wrong tracks of wasted effort where someone’s “experience and intuition” about necessary optimizations or necessary architectural trade-offs ends up being irrelevant to the eventual end goal, the changing requirements from the sales team, etc. etc.
This happens so frequently and clashes so miserably with a basic YAGNI pragmatism that it really is a good heuristic to just say that early focus on optimization characteristics is always premature.
In the most optimization-critical applications I’ve worked on, getting primitive and poorly performing prototypes up and running for stakeholders has always been the most critical part to completing the project within the necessary performance constraints. Treating it like an iterative, empirical problem to solve with profiling is absolutely a sacred virtue for pragmatically solving the problem, and carries far, far less risk than precommitting to choices made on the speculative basis of performance characteristics borne out of someone else’s “experience” with performance-intensive concepts or techniques. Such experience really only matters later on, long after you’ve gathered evidence from profiling.