
Performance bugs – the dark matter of programming bugs - Ono-Sendai
https://www.forwardscattering.org/post/49
======
dahart
> Performance bugs are out there, lurking and unseen

Is it a bug if it's unseen? Code that runs slower than it could in theory is
literally _everywhere_. But is it worth paying the cost of the time to fix it,
if you aren't experiencing any delays?

I love the way Michael Abrash talks about optimizing. He emphasizes and re-
emphasizes that optimization is for people not computers, and that if you
aren't experiencing a delay, you're done.

"In other words, high-performance code should ideally run so fast that any
further improvement in the code would be pointless.

"Notice that the above definition most emphatically does not say anything
about making the software as fast as possible. It also does not say anything
about using assembly language, or an optimizing compiler, or, for that matter,
a compiler at all. It also doesn’t say anything about how the code was
designed and written. What it does say is that high-performance code shouldn’t
get in the user’s way—and that’s all."

[http://www.jagregory.com/abrash-black-book/#the-human-
elemen...](http://www.jagregory.com/abrash-black-book/#the-human-element-of-
code-optimization)

~~~
shanemhansen
User focused metrics:

1\. It's a user facing bug if I have to purchase a bigger laptop because
someone decided to get the smallest element of a list by typing
list.sort()[0].

2\. It's a user facing bug if the use of my app heats up their phone, drains
the battery, and leaves someone stranded w/o a way to get safely get home on a
friday night.

3\. It's a business facing bug if my AWS bill is in the hundreds of thousands
per month when I could be paying a mere 10s of thousands. (This is more common
than you'd think)

4\. It's a user facing bug if a sudden increase in traffic (due to my product
being featured somewhere) takes down my services and results in lost sales and
brand damage.

I understand your point, it's not necessarily a problem if a backend feed is
taking 45s instead of 30s, or if the the text input subroutine for an
interactive app allocates.

As far as I can tell as a software developer and daily computer user, people
aren't erring on the side of making their software too fast.

~~~
dahart
100% agree with all of these, and you did what the author didn't and added
criteria that affect people. If these happen, then make the code faster!

Abrash is a well known games/graphics programmer, and part of his point is
that if your game is already running at 60fps, there's no point in replacing
list.sort()[0] with something else. No milliseconds saved will help anyone in
that case. (Though yes, as many pointed out, power consumption may be an
important metric).

Personally, I'd say if your backend feed is taking 45s and could easily take
30s, then you should fix it. :)

> people aren't erring on the side of making their software too fast

:) 100% agree... too fast isn't a problem we have. But, over-engineering is an
exceedingly widespread problem, and that's where I'm coming from. If there's a
user-focused metric that shows a problem, yes fix it. If there isn't a user-
focused metric, then think twice. The article didn't advocate a user-focused
metric as far as I could tell.

~~~
sqeaky
> if your game is already running at 60fps, there's no point in replacing
> list.sort()[0]

If the game runs on a device with a battery then reducing that load is a
reason. You already implied that though.

I suspect hard to quantify improvements in quality is not what you are trying
to rail against. If the software is going to be used widely or for long
duration hard to quantify improvements can easily save millions in the long
run.

There are over-engineering efforts where people try to shoehorn in bad design
strategies, over-abstract an already well abstracted or refactor well working
code to be more "testable" when it is already tested. All of those should be
opposed, but how do you draw the line be over-engineering and hard to quantify
quality increases?

~~~
dahart
> how do you draw the line be over-engineering and hard to quantify quality
> increases?

That is a really good question. That is _the_ question, IMO. I don't know the
answer, and if I did I'm pretty sure I could make a lot of money doing it.
It's a balance, maybe the right thing to do is just make sure you have forces
in both directions, and don't let one force dominate for too long. Make sure
in your working groups there are always people asking if there's a user facing
metric, a real tangible reason or justification for the code you're planning
to write or optimize. And also make sure there are people who are good at
planning for the future and are good at coming up with user-facing
justifications for upcoming plans.

------
btilly
I have seen too many of these to keep track of.

But I will never forget once having to work with some Windows software written
by someone else. He decided to parse a CSV file by reading the whole thing in
memory, and then literally chopping off one character at a time..copying the
rest of the string. It made what should be an instantaneous import of a fairly
small file into a 20 minute ordeal. Simply reading in one line at a time would
have been a massive speedup.

I did not have access to change it. I asked him to fix it. I begged him to fix
it. His response was to shrug and say, "Who cares? This runs as part of a
batch process with nobody in front of it. It works fine." That may be, but _I_
cared because _I_ had to sit through it repeatedly while debugging certain
misbehavior that arose in said batch process.

Years later I was talking to an ex-coworker who stayed at said company after I
left. He reported that that original developer after I left had to make a fix
to said batch process. While doing so he had fixed the performance bug and was
an instant hero among a number of other people who had to sit through the same
import repeatedly over time.

Yeah..it turns out that he had made the fix I'd originally asked. And nobody
realized that he had caused the problem originally AND refused to fix it.

That's one developer that I'm glad I will never have to work with again...

~~~
dahart
That would be pretty frustrating and annoying!

This reminds me of a story I heard about a lead game programmer who made a
habit of pre-allocating an unused megabyte at the beginning of a project. The
last week before ship, when the game would crash out of memory often and
everyone was freaking out and trying to squeeze every last byte out of their
code, he'd comment the allocation nobody had noticed, free up a megabyte, and
be the hero that made the game fit.

~~~
btilly
That one is famous.

But in the version that I heard, the lead programmer didn't do it to be a
hero. He did that because he knew that it is easier to not waste memory up
front than it is to figure out where you can save it at the end. Therefore he
made sure that memory became an obvious problem early on so that people were
careful about it the rest of the way through development.

In other words _because_ he did that, the game hit memory targets that it
never could have had people waited until they tried to squeeze everything in
at the last moment and only then found that they were out of memory.

------
wolfgang42
Quote from the linked article by John Carmack[1]:

> The fly-by-wire flight software for the Saab Gripen (a lightweight fighter)
> went a step further... Control flow went forward only. Sometimes one piece
> of code had to leave a note for a later piece telling it what to do, but
> this worked out well for testing: all data was allocated statically, and
> monitoring those variables gave a clear picture of most everything the
> software was doing.... No bug has ever been found in the “released for
> flight” versions of that code.

When I started writing Arduino code, I independently re-invented this
technique. I write my code with only fixed-size loops and no delay() calls
(basically cooperative multitasking), making heavy use of finite-state
machines to keep track of what should happen when.

This style of code takes some getting used to, but its advantages are
enormous: I see other people using delays and busy-loops and then struggling
to get interrupts working to keep their device responsive, whereas I have a
clean linear set of logic that can be easily understood and is free of strange
race conditions and the like, and is _guaranteed_ to keep responding to
events.

This style also makes it easy to literally eyeball performance: attach a
bicolor LED (one that has e.g. red and green LEDs in the same package) and
tell it to toggle colors once per loop. If the main loop is taking more than 8
milliseconds or so the LED will start to flicker visibly; if something pauses
the LED will go solid and it's obvious the main loop is frozen.

[1] [http://number-
none.com/blow/blog/programming/2014/09/26/carm...](http://number-
none.com/blow/blog/programming/2014/09/26/carmack-on-inlined-code.html)

(Edits for clarity and writing style.)

~~~
RangerScience
I would like to see more of this style of code (specifically in the Arduino
setting) as I'm having trouble seeing what you're up to from here. Is some
place I can read more - ideally, actual code?

~~~
wolfgang42
Most of the code I've written either can't be released for privacy reasons (I
don't feel like scrubbing out the info of the person I wrote it for) or is
weird one-off experimental code that's a pain to explain.

However, I did find one standalone project that illustrates the point quite
elegantly; I've just written up some documentation and posted it on
[https://github.com/wolfgang42/iDropper-
arduino](https://github.com/wolfgang42/iDropper-arduino)

This code cheats a _little_ bit in that it has some delay loops in it; this is
because the timings are finicky and it was a lot easier to write this way.
Notice, however, that the `idrop_loop()` function that drives everything still
specifies a four-millisecond maximum runtime, far under anything that I've
ever needed to worry about.

Let me know if you have any questions about this; I feel that some of the
explanation I wrote isn't quite as clear as it should be, and because the code
is so minimal (it's really more of a library than a full project) it doesn't
illustrate the point quite as well as I'd like.

~~~
dahart
I don't have any examples to share, but I'll add that the thinking behind this
style is sometimes codified by style guides for projects that requires high
levels of code safety. The best example I can think of is the famous "Power of
Ten" rules for code safety published by NASA. Follow these rules, and the
byproduct will be the kind of code we're talking about.

[http://pixelscommander.com/wp-
content/uploads/2014/12/P10.pd...](http://pixelscommander.com/wp-
content/uploads/2014/12/P10.pdf)

------
rosshemsley
One of my favourite examples:

A friend once sped up some code 90% by changing a

list.size() == 0

to

list.empty()

yes, std::list::size() can be _linear_ in C++98.

[http://www.cplusplus.com/reference/list/list/size/](http://www.cplusplus.com/reference/list/list/size/)

~~~
logophobia
std::list is a linked list, so it makes sense that size() is linear. Only if
you store the size separately, you get constant behavior (which the c++11
enforces), it's constant.

Your friend probably could've sped his code up even more by using a
std::vector. Linked lists have very bad processor cache behavior (not
allocated in continuous memory), there are not a lot of things a list does
better than a vector.

Only if you prepend a lot, or insert in the middle of a list, it might have
some advantages over a vector, even then, there're are better alternatives.

~~~
sqeaky
I have timed it on several implementations and posted (mostly on on
reddit.com/r/cpp), but until you get several thousand items std::vector is
just faster than std::list. Of course it varies from machine to machine, for
example my 4th gen i7 could insert about 8,000 items before list finally got
the same speed as inserting at the front of an std::vector and my AMD 8150
(The bulldozer thing way before ryzen) was about 3,500.

I did similar tests comparing linear scans through an std::vector and lookups
in a std::map or std::unsorted_map, with similar results, many thousands of
items are required before dumb vectors got slow.

My default rule is to use a std::array if the size is fixed or an std::vector
if the size changes until I hit 1,000 items. Then look at the complexity
notation of the operations I am doing and choose what containers to then to
benchmark. Then I throw a couple microbenchmarks in the test suite and use
whatever that tells me will be faster in my case. Sometimes I even go so far
as to use ifdefs and typedefs to change the container for different builds.

In practice I use a lot of std::vectors and I sort them a lot. Sorted vectors
are very fast in the general case. In the few workloads I have with large
numbers of items, even with millions of items, I use std::sort and
std::binary_search with std::vector because they can be much faster than other
associative containers. You do need to use operator< intelligently and that
means either already passing in an overload for the comparator function or
making a simple container class that wraps these concepts the way that makes
sense for your workload.

~~~
terrencecrowley
Amazing how often just replacing a std::vector with a gap buffer vector
implementation will get both the benefits of vector storage combined with good
insert/delete behavior (presuming locality of such operations which is often
the case, especially when doing something that scans through the array).

------
ptx
One recent example of this is Synapse, the Matrix messaging server.

Synapse was generally considered to be inefficient and a huge memory hog, so
they have been experimenting with various ways to rewrite it in Go, saying[1]
at the end of last year:

"Synapse is in a relatively stable state currently ... we’re starting to hit
some fundamental limitations of the architecture: ... python’s single-
threadedness and memory inefficiency; ...; the fact the app papers over SQL
problems by caching everything in RAM (resulting in synapse’s high RAM
requirements); ..."

But in January[2] it was discovered, during the investigation of another
problem, that the vast majority of the memory usage was due to a bug:

"It’s also revealed the root cause of why Synapse’s RAM usage is quite so bad
– it turns out that it actually idles at around 200MB with default caching,
but there’s a particular codepath which causes it to spike temporarily by 1GB
or so – and that RAM is then not released back to the OS."

I wonder if they would still have started the Go rewrite if it hadn't been for
that bug. (It seems the new implementation has a better architechture and is
substantially faster, so I guess it's still a good idea though, but maybe not
as urgent as before?)

[1] [https://matrix.org/blog/2016/12/26/the-matrix-holiday-
specia...](https://matrix.org/blog/2016/12/26/the-matrix-holiday-
special-2016-edition/)

[2] [https://matrix.org/blog/2017/01/06/synapse-0-18-6-is-out-
ple...](https://matrix.org/blog/2017/01/06/synapse-0-18-6-is-out-please-
upgrade-especially-if-on-0-18-5/)

~~~
Namrog84
This is a great example of why you can't take perf related issues intuitively
or at face value. Profiling and in depth analysis is always needed. I wonder
if this could have been more easily spotted earlier on if a better analysis
was done.

------
bluetomcat
Back in 2001, Joel Spolsky coined "Schlemiel the Painter's algorithm" to
illustrate the same class of inefficient programming techniques:
[https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...](https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Painter.27s_algorithm)

His particular example was with the standard C library function strcat where
each new invocation traverses the enlarged string over and over from the
beginning, in order to find the terminating NUL character and append the new
part.

------
bastijn
I define a special category of distributed system performance bugs. The fact
that proper distributed tracing is not standard yet (especially outside of
cloud environments) is a real pain here. The google white paper on Dapper [0]
and all its implementations [1][2][3] in one form or another are appearing but
usually target the Web environments. That means that for my legacy stack with
multiple nodes connected via networks of different kinds I have to profile on
multiple systems, aggregate different log types manually, think of time syncs
(not everything logs utc..) and so on. To be honest we just reached the point
where we try to get a European subsidiary project rolling where we address
this problem. The time lost to trace down unexplainable time gaps between
first displayed result in app and request are weeks, and in multiple occasions
months. Just because we have no easy way to profile our distributed ecosystem.

After identifying who is to blame: network, db nodes, processing nodes,
application logic, cache (invalidation); we can talk about fixing.

For fixing of course it would be great if you could extract a meta model of
the application and another of the hardware. Changes to those models and some
nice prediction of impact would be the ideal end goal. Or, if we are wishing
anyway, some nice DSE on those models to optimize would be even better.

[0]
[https://research.google.com/pubs/pub36356.html](https://research.google.com/pubs/pub36356.html)
[1] [http://opentracing.io/](http://opentracing.io/) [2]
[http://lightstep.com/](http://lightstep.com/) [3]
[https://aws.amazon.com/xray/](https://aws.amazon.com/xray/)

~~~
pritianka
OpenTracing has libraries for C++ (though it needs work tbh) and Java which
can be used for mobile. The idea is to cover your entire stack ...

~~~
bastijn
I know. But our stack is C#, Wcf, websockets with c++ where performance is
required and some RestAPI/wpf/html5/typescript/javascript ui/clients.

The stack is something grown the past 15 or so years (medical business). The
existing options are not mature enough yet for us to be used, or not available
yet on our platforms.

------
glangdale
Occasionally I like the idea of producing some sort of dynamic instrumentation
code (I privately think of it as "shitgrind") that detects dynamic code that
is doing redundant work and flags it. Of course, this doesn't tell you that
the code is useless (i.e. I might have to initialize something that
subsequently isn't used down some deep and not-anticipatable-at-compile-time
set of branches), but it might provide hints that I'm doing a lot of writes
that never match up with reads, or that the writes/calculations that I'm doing
are going into places that already have that value. It would be incredibly
slow but there are plenty of uses I can imagine where you could detect
performance problems in a toy version of the problem (maybe running 100-1000x
slower) that are the same problems that you would have in the real version of
the problem at full scale. Would be a fun project, IMO.

~~~
AstralStorm
Assembly instruction level profiler? Sounds like an extension to callgrind.

~~~
glangdale
Well, not just instruction profiling, but profiling of effects and tracing of
all reads/writes. Very, very expensive.

------
unsoundInput
Even though I agree that >>insert method of std::map is roughly 7 times slower
than it should be<< is bad, these kind of problems are not too hard to find
and solve if they are actually problematic for your software.

The most problematic performance issues I've come across were usually
bad/premature optimizations that were not (correctly) validated against a
simpler implementation as a performance baseline. Things like parallelism
(multi-threading, web workers) or caching can absolutely tank performance if
not done correctly. Plus they usually tend to make stuff more complex and bug-
prone.

------
pjc50
This is one of the few areas where adding more levels of abstraction cannot
help you fix the problem. It's remarkably easy for some library function or
low level of the program to have O(n) performance that ends up getting called
in a loop with the same input or output.

~~~
wtetzner
> This is one of the few areas where adding more levels of abstraction cannot
> help you fix the problem.

I'm not sure that's true. It is of course true that adding levels of
abstraction often degrades performance, but I don't think it's necessary.

In fact, programming languages themselves are abstractions. It's pretty well
agreed at this point that a good C compiler will generate faster code than
someone writing assembly. It's of course _possible_ to write code that's at
least as fast in assembly, but in practice, nobody would do it, because the
program would be unmaintainable.

You can make abstractions fast, and one of the benefits of doing so is that
you can make your abstraction be the _easy_ way of doing things. That means,
by picking the easiest solutions, the people on your team are _also_ using an
efficient solution.

~~~
dahart
> It's pretty well agreed at this point that a good C compiler will generate
> faster code than someone writing assembly.

I've honestly never heard that before. If you'd said "a good C compiler will
generate code faster than someone writing assembly," then I'd agree... ;)

~~~
bluetomcat
How many die-hard assembly programmers today would do a better job in register
allocation, instruction scheduling based on the latency of each individual
instruction, instruction selection to pick the instructions with the minimum
encoded length, avoiding pipeline hazards or doing inlining decisions? These
are hard optimization problems that require some real CPU juice to get right.
An additional bonus is that the compiler does these really quickly for you and
you don't have to worry about the instruction scheduling each time you
refactor the code, for example.

~~~
gens
Funny thing is that one of the bigger problems with optimizing compilers _is_
register allocation. Not only is it difficult to write a good algorithm for
it, but compilers don't always know what variable is most important to keep in
a register. Gcc even implemented a weaker allocator [0] as the older one was
too complex to decipher problems with it.

Instruction scheduling is not a problem on modern cpu's. Not that it's hard
either way, just reorder things around memory access and heavy instructions
like division. Agner Fog made a nice table of instruction latencies and
throughputs [1].

Instruction encoding length... do you use 16bit integers or do you just int
everything ? The compiler has to do what you tell it to do. Sure there are
plenty of tricks around that but those are not the ones that give the compiler
a remarkable advantage over humans. Tricks like using bitwise operations
instead of multiplication and such and shortening equations is the advantage
that the compiler has. But even that is not that big of a deal. (again, do you
use the restrict keyword ? or static/define for constants ? do you know that C
treats every "compile unit"(.c file) as a separate program ?)

[0]
[https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=g...](https://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=Local_Register_Allocator_Project_Detail.pdf)

[1] [http://www.agner.org/optimize/](http://www.agner.org/optimize/)

~~~
AstralStorm
I wonder if there is a resource for ARM CPUs like Agner's...

------
ericb
I'm open sourcing a Java Selenium library with page performance assertions
built in as part of my startup. It will let you do things like:

    
    
       test.config.setSLAMaxTime(2000); // set time in MS
    
       assertSLAMet(); // add this to any page
    
    

It will also allow you to assert specific URL's, and window timings. You can
then use these assertions to pass/fail builds. I also have some other nifty
ideas on preventing performance regressions I'm building into this (smart
benchmarking vs. previous).

If anyone is interested, I'll be announcing it here when we open source it:

[http://signup.browserup.com/](http://signup.browserup.com/)

~~~
orf
> What if your Selenium tests had performance asserts for CI/CD

Isn't this just... a single function? Like, how the heck do you build a
startup around that. We do this as part of our Python selenium tests, I didn't
realize it wasn't built in to be honest.

~~~
ericb
There's a lot more to it--this is just a tiny part we're giving away, not
something we'll make money on.

We do real-browser load testing in the cloud using the same scripts folks use
for integration testing. Because they can also run locally, you can run them
alongside your integration tests with each commit, and you always know you
have working scripts. Scripts are easy to code and maintain because they can
use page objects like functional tests.

This means you can have versioned load test/monitoring scripts in your project
that always work with a particular sha. When you deploy, you can start
transactional monitoring immediately. It also means load testing is not a
separate stage in your development lifecycle.

------
kazinator
It's only a bug if the stakeholders that surround the program agree that there
should be (or already is) in place a performance _requirement_ which is not
being met.

Opportunities to uncover performance that could be improved and impose
requirements against it are out there, lurking unseen. Only problem is you
have to convince people to care on a case-by-case basis.

------
tcopeland
At one point I was working on a utility to detect suboptimal sequence of
method calls; canonical example is using "[1,2,3].select {|x| x > 1 }.first"
rather than "[1,2,3].detect {|x| x > 1}". These can be performance issues,
although the bigger win is in readability and communicating the developer
intent. More details and examples here if you're interested:

[https://thomasleecopeland.com/2014/10/22/finding-
suboptimal-...](https://thomasleecopeland.com/2014/10/22/finding-suboptimal-
api-usage.html)

I haven't worked on it much recently though because the problems it found
weren't that significant. But I like the idea of runtime analysis for finding
issues, especially in a dynamically-typed language.

------
bitwize
An internet points out the obvious -- that performance matters in computing.
Hackernews nods in agreement, then goes back to coding Ruby on Rails
applications in an editor implemented in HTML5 and JavaScript that consumes
13% CPU just to make the cursor blink.

------
luckydude
"It takes an experienced programmer, with a reasonably accurate mental model
of the problem and the correct solution, to know how fast the operation should
have been performed, and hence if the program is running slower than it should
be."

I love this sort of approach to performance. If you know the hardware, the
software, and the application, you can predict how fast it should go. That's
surprisingly rare, in my experience, sadly. About the time I wrote lmbench I
was sitting in a lot of meetings at Sun where people were saying "we should do
this" and I could do a mental estimate of how fast it would go (I memorized
most of the stuff that lmbench measured so I know how many packets/sec we
could do, how many iops, how much memory bandwidth we had, etc). You'd be
amazed at the number of times "architects" were proposing to build something
that couldn't possibly work.

It's pretty systems-y and not that common in programmers, what with today's
frameworks and all. But there is still some value in being able to predict how
fast something should run. I love it when I see people doing that with the
needed knowledge to be accurate, very fun to watch.

------
songlinhai
I did several research projects on performance bugs during my phd.
[https://songlh.github.io/](https://songlh.github.io/)

------
radarsat1
Reminds me of
[https://accidentallyquadratic.tumblr.com/](https://accidentallyquadratic.tumblr.com/)

------
makecheck
Performance sort of has a paradoxical "coffee test": something that takes a
few seconds is aggravating and you will _wait_ for it but something that
routinely takes 5 minutes will make you go get coffee or switch tasks
completely.

With enough alternate tasks, waits are absorbed. Oddly, overall _throughput_
can be worse with sporadic tiny delays that are technically indicative of a
fast task.

You don't just want fast operations, you want programs that can be smart about
gathering unavoidable delays into chunks. This can even pay dividends when
consuming resources, as the gathered steps may use less (e.g. laptop wakes
dormant hardware just once to perform several grouped tasks, instead of
repeatedly over a short time frame).

------
base698
In practice the majority of web app performance have more to do with latency
of IO. People seem to forget that 4ms may seem fast to humans, but it's
glacial to computers. Add 100 4ms calls to the DB and all the sudden your code
is waiting for half a second if it does nothing else.

~~~
AstralStorm
Bonus points that in a real real-time system (anything with a GUI) you might
have only a 1 ms or less to handle an event in a bigger stream, e.g. dragging.
This is why so much UI is or should be really asynchronous.

------
glangdale
We run into this a fair bit of the time with Hyperscan. One particularly
noxious anti-pattern that can arise is the "prefilter that isn't" anti-
pattern: specifically, a multiple stage pipeline where the initial stage is
meant to filter out some portion of false positives when feeding to a more
expensive check, but isn't doing its job.

The painful thing is that the expensive check downstream winds up doing
everything, but the code still "works" because the expensive check is the only
necessary part (well, as long as the pre-filter isn't totally hosed and isn't
producing false negatives).

------
partycoder
Performance issues come in all flavors...

\- CPU usage (complexity, cache misses...)

\- Memory usage (leaked resources like memory, connections and handles)

\- Excessive I/O (networking, files...)

\- Concurrency (contention, starvation, deadlocks, livelocks...)

There's no one size fits all profiler. I find load testing in combination with
profiling very useful to get an idea of how a system performs.

Usually I keep injecting load until the system fails then find what failed.

There are other approaches, like automated performance testing (writing tests
with assertions on performance). e.g:
[https://scalameter.github.io](https://scalameter.github.io)

------
marknadal
Oh boy, it becomes far worse when you are dealing with javascript and web
apps. Pretty much all modern day "best practices" are not, but bullet hole
leaks of performance.

I spent about a year researching this and put together my findings into this
(too short) 20 minute tech talk: [https://youtu.be/BEqH-
oZ4UXI](https://youtu.be/BEqH-oZ4UXI) hopefully others will find it useful
too.

Or just never use JS ;) but it is possible to make it fast.

------
prohor
The fun begins when you need to find performance issues in production, where
you cannot really use profiler. Then you need to jump into APM tools.
Unfortunately it seems there nothing really for that level for free, but take
this for example: [https://www.dynatrace.com/blog/code-level-visibility-for-
nod...](https://www.dynatrace.com/blog/code-level-visibility-for-node-js/)

~~~
whyever
Can't you just use something like performance counters, like perf for Linux?

~~~
itsderek23
In my experience, it's very difficult to tie profiling data from generic
profilers to specific requests, then to the specific lines-of-code triggering
the problems.

This is important because many performance conditions don't reveal themselves
all of the time: for example, it's very common that an issue might only be a
problem for your largest customers. The context is really important.

Scout has a production-safe profiler for Ruby apps that builds on the
wonderful StackProf gem that does this:
[http://help.apm.scoutapp.com/#scoutprof](http://help.apm.scoutapp.com/#scoutprof)

------
smnscu
Already down. edit: up again now

[https://webcache.googleusercontent.com/search?q=cache:o4ftZJ...](https://webcache.googleusercontent.com/search?q=cache:o4ftZJqeuhsJ:https://www.forwardscattering.org/post/49+&cd=1&hl=en&ct=clnk&gl=ro)

~~~
Ono-Sendai
Oh the irony :) Will try to fix.

Edit: Back up, let's see how long it lasts now :)

~~~
yetihehe
Less than 15min ;)

Edit: back up after further 10min. Seems like it moves in waves.

------
kevindqc
> when I was working on the Doom 3 BFG edition release, the exactly predicted
> off-by-one-frame-of-latency input sampling happened, and very nearly shipped

What is this predicted off-by-one-frame-of-latency input sampling ?

------
amorphid
If performance matters, benchmarking​ the heck out of everything helps quite a
bit! I like the style of benchmarks that focus on comparing iterations per
second.

------
nateberkopec
Personal blog authors - if you insist on not using Medium/similar big box
blogging service, just host your blog as flat files on S3. It will never, ever
go down when you hit number 1 on HN (as this poor author's site has).

~~~
skarap
It wouldn't go down with flat files even if you hosted it on an arduino
connector to your phone's tethered wifi.

~~~
toast0
How many tls handshakes per second can an arduino manage? Https everywhere
isn't cheap.

------
maximilianroos
Ironic that the site is down...

------
draw_down
"Should" is a tricky word.

